NRG-GY003 WES Analysis

Whole Exome genomic analysis of NRG-GY003 ovarian cancer tumors

Genomic analysis of tumor exomes from patients with recurrent ovarian cancer treated with immune checkpoint inhibitors (NIVO or IPI/NIVO) treated on the NRG-GY003 clinical trial.
Author
Affiliation

Kelsey Monson, PhD, MS

Icahn School of Medicine at Mount Sinai

Published

July 31, 2025

Keywords

Genomics, Whole Exome, Immunotherapy, R, Ovarian Cancer

Intro

Variant Filtering Strategy

I first used the Nextflow Sarek pipeline to align the raw fastq files. Below is the workflow I used:

Note that, at this time, I have only merged the variants from the SNP/indel callers, as ASCAT and the other SV/CNV callers do not have .vcfs as their output.

Here is the detailed variant filtering strategy I used to come up with the MAF file used in this analysis:

  • First find consensus somatic calls:

    • Present in >=2 somatic callers (Freebayes, Mutect2, Strelka2)
    • Include variants annotated with PASS
      • PASS: indicates the variant passed all quality filters applied by the variant caller

      • Considered the gold-standard for high-quality variants passing the variant caller’s built-in quality control filters

    • Further refine PASS variants
      • Minimum tumor Read Depth = 20: ensures sufficient coverage (i.e. evidence) to confidently call tumor variants from normal and/or distinguish from random sequencing errors

      • Allele frequency 0.05 < AF < 0.95:

        • >0.05: also helps avoid random sequencing errors/miss-called variants

        • <0.95: helps avoid germline contamination

  • Identify rare pathogenic germline variants:

    • Present in >=2 germline callers (Freebayes, Haplotypecaller, Strelka2)

    • Identify those most likely to be pathogenic

      • Limit to protein-coding variants (drop all intergenic and non-coding variants)

      • Must be annotated with HIGH impact

    • Eliminate common variants

      • “Rare” variants are typically those present in <1% of the population

      • Because I was worried about missing variants, I initially excluded those with >5% population frequency, but this was too liberal.

      • I revised the script to set the threshold to 1% and will re-generate the MAF file eventually

      • For the current analysis, I QC’d the variants and only identified likey oncogenic variants in BRCA1/2, so limited the germline analysis to those genes.

    • Include all likely oncogenes regardless of population frequency (“BRCA1” “BRCA2” “TP53” “PTEN” “ATM” “CHEK2” “PALB2” “APC” “MUTYH”)

  • Generated consensus MAF file

    • The above analysis generated two vcfs, one with the processed somatic variants and one with likely pathogenic germline variants

    • These variants were annotated as SOMATIC or GERMLINE_PATHOGENIC, respectively in the vcfs

    • I then concatenated these variants to one MAF file for the entire sample using the Nextflow vcf2maf pipeline.

This consensus MAF is what is used for the downstream analysis presented here.

Loading in the data

Packages Used
# Load packages

library(maftools) # For majority of maf file analysis
library(mclust)
library("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)
library(NMF) # For signature calculation
library(pheatmap) # For nice heatmaps
library(ghibli) # For pretty colors

Load in raw MAF file and the clinical data

laml = read.maf(maf="input/merged_consensus_all_samples_germline.maf", 
                clinicalData ="input/NRG-GY00_laml_annot_2.csv",
                rmFlags = TRUE # "FLAGS, frequently mutated genes in public exomes"
                )
-Reading
-Validating
-Removing 20 FLAG genes
-Silent variants: 30668 
-Summarizing
-Processing clinical data
-Finished in 2.990s elapsed (2.940s cpu) 
Tip

Setting rmFlags = TRUE removes frequently mutated genes in public exomes

Details on these genes can be found here.

Data cleaning: setting levels for factor variables
# RECIST response (CR, PR, SD, PD)
laml@clinical.data$response <- factor(laml@clinical.data$response, levels=c("CR","PR","SD","PD"))
#table(laml@clinical.data$response)

# Detailed response including durable and progressive SD
laml@clinical.data$response2 <- factor(laml@clinical.data$response2, levels=c("CR","PR","SD_CB","SD_NCB","PD"))
# table(laml@clinical.data$response2)

# Response by treatment
laml@clinical.data$response_tx <- factor(laml@clinical.data$response_tx, levels=c("Nivo_CB","Combo_CB","Nivo_NCB","Combo_NCB"))
#table(laml@clinical.data$response_tx)

# Race (White, Non-White, and Not Reported)
laml@clinical.data$race2 <- factor(laml@clinical.data$race2, levels=c("W","NW","NR"))
#table(laml@clinical.data$race2)

# Site (dichotomous)
laml@clinical.data$Site2 <- factor(laml@clinical.data$Site2, levels=c("Adnexa","Non_Adnexa"))
#table(laml@clinical.data$Site2)

# Primary/Met
laml@clinical.data$Primary_Met <- factor(laml@clinical.data$Primary_Met, levels=c("P","M"))
#table(laml@clinical.data$Primary_Met)

Subsetting datasets

We have some normal samples with no matched tumor, and we may be interested in doing subtype-specific analysis.

Let’s subset to only tumor samples, and then create further subsetted datasets based on clinical characteristics.

Only tumor samples

I also annotated with pathogenic germline variants; the only relevant ones were in BRCA1 and BRCA2, so we are subsetting to somatic mutations and pathogenic germline variants.

laml_tum = subsetMaf(maf = laml, clinQuery = "Tissue %in% 'Tu'")
-subsetting by clinical data..
--86 samples meet the clinical query
-Processing clinical data
laml_tum = subsetMaf(maf = laml_tum, query = "vcf_id %in% c('SOMATIC','GERMLINE_PATHOGENIC') ")
-Processing clinical data

Other subsets

We can also subset by tumor histology, treatment, and response.

## Comparing Serous vs others
laml_ser = subsetMaf(maf = laml_tum, clinQuery = "celltype %in% 'Serous_Adenocarcinoma'")
-subsetting by clinical data..
--71 samples meet the clinical query
-Processing clinical data
# Subsetting the other dataset to "not serous"
`%ni%` <- Negate(`%in%`)
laml_other = subsetMaf(maf = laml_tum, clinQuery = "celltype %ni% 'Serous_Adenocarcinoma'")
-subsetting by clinical data..
--15 samples meet the clinical query
-Processing clinical data
## Comparing CB vs NCB
laml_CB = subsetMaf(maf = laml_tum, clinQuery = "responseCB %in% 'CB'")
-subsetting by clinical data..
--29 samples meet the clinical query
-Processing clinical data
laml_NCB = subsetMaf(maf = laml_tum, clinQuery = "responseCB %in% 'NCB'")
-subsetting by clinical data..
--57 samples meet the clinical query
-Processing clinical data
## Subsetting by treatment to see if there are differences
### There shouldn't be any, since treatment assignment was randomized, but is is good to confirm.
laml_nivo = subsetMaf(maf = laml_tum, clinQuery = "tx %in% 'Nivo'")
-subsetting by clinical data..
--43 samples meet the clinical query
-Processing clinical data
laml_combo = subsetMaf(maf = laml_tum, clinQuery = "tx %in% 'Combo'")
-subsetting by clinical data..
--43 samples meet the clinical query
-Processing clinical data

View the MAF file summary

Here is a basic summary of the MAF file

laml_tum
An object of class  MAF 
                        ID summary    Mean Median
                    <char>  <char>   <num>  <num>
 1:             NCBI_Build  GRCh38      NA     NA
 2:                 Center       .      NA     NA
 3:                Samples      86      NA     NA
 4:                 nGenes    6897      NA     NA
 5:        Frame_Shift_Del     234   2.721    2.0
 6:        Frame_Shift_Ins      43   0.500    0.0
 7:           In_Frame_Del      83   0.965    0.0
 8:           In_Frame_Ins       6   0.070    0.0
 9:      Missense_Mutation    9193 106.895   81.0
10:      Nonsense_Mutation     524   6.093    4.0
11:       Nonstop_Mutation      13   0.151    0.0
12:            Splice_Site     505   5.872    3.0
13: Translation_Start_Site      19   0.221    0.0
14:                  total   10620 123.488   92.5
These are some more summaries you can generate
# I'm commenting them out as they have too much info for the tables to be readable.

# #Shows sample summary.
# getSampleSummary(laml_tum)

# #Shows gene summary.
# getGeneSummary(laml_tum)

# #shows clinical data associated with samples
# getClinicalData(laml_tum)

# #Shows all fields in MAF
# getFields(laml_tum)

Visual Summaries

Summarize the maf file

We can use plotmafSummary to plot the summary of the maf file, which displays number of variants in each sample as a stacked barplot and variant types as a boxplot summarized by Variant_Classification.

plotmafSummary(maf = laml_tum, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

#Use mafbarplot for a minimal barplot of mutated genes.
mafbarplot(maf =  laml_tum)

Summarize the other subsets

Here are the mutation distributions for the other subsets.

As a sanity check, the majority of Serous samples have TP53 mutations, while there are few TP53 mutations in the top 10 for the non-Serous samples (as is expected).

By Histology

Serous

## Serous 
plotmafSummary(maf = laml_ser, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_ser)

Non-Serous

## Non-serous
plotmafSummary(maf = laml_other, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_other)

By Response

CB

## CB
plotmafSummary(maf = laml_CB, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_CB)

NCB

## NCB
plotmafSummary(maf = laml_NCB, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_NCB)

By Treatment

Nivo

## Nivo
plotmafSummary(maf = laml_nivo, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_nivo)

Combo

## Combo
plotmafSummary(maf = laml_combo, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_combo)

Oncoprints

Oncoprints (or “oncoplots” as the wrapper function in maftools is called) summarize complex genomic features of a given dataset.

#oncoplot for top ten mutated genes.
oncoplot(maf = laml_tum, top = 10)

These are the key features and how they are interpreted:

  • Columns represent individual patients.
    • Reading column-wise, you can see the collection of alterations in a patient’s tumor for the given set of genes.
    • The bar plot on the top is a count of tumor mutation burden per patient, color-coded by mutation type.
  • Rows are altered genes.
    • Alterations are color-coded to indicate their type (e.g. mutation, deletion, insertion)
    • Genes can be further annotated with key features (e.g. high impact/likely pathogenic mutations, germline variants, etc.)
  • The bar plot on the right summarizes alterations in a given gene for the entire dataset.
  • Many more features and annotations can be added to further characterize the dataset.
# Highlight by specific attribute in MAF file
oncoplot(maf = laml_tum, 
         additionalFeature = c("IMPACT", "HIGH"))

# Add transitions/transversions  
oncoplot(maf = laml_tum, draw_titv = TRUE)

# cBioPortal alteration nomenclature 
oncoplot(maf = laml_tum, cBioPortal = TRUE)

#KelseyNote

Something to be aware of, especially as there are a number of mucins that are coming up, is that some samples were tumor-only and some were matched normal; would be good to annotate the metadata with this info and see if the mucins are coming up more in the tumor-only samples, since these genes tend to be highly polymorphic.

Oncoprints by clinical data

## Primary vs metastatic sites
oncoplot(maf = laml_tum, clinicalFeatures = 'Primary_Met', sortByAnnotation = TRUE)

## Cell type
oncoplot(maf = laml_tum, clinicalFeatures = 'celltype', sortByAnnotation = TRUE)

## ICI response
oncoplot(maf = laml_tum, clinicalFeatures = c('responseCB','response2'), sortByAnnotation = TRUE)

## ICI response and treatment
oncoplot(maf = laml_tum, clinicalFeatures = 'response_tx', sortByAnnotation = TRUE)

## Mutational signature
oncoplot(maf = laml_tum, clinicalFeatures = 'Signature', sortByAnnotation = TRUE)

## Race
oncoplot(maf = laml_tum, clinicalFeatures = 'race', sortByAnnotation = TRUE)

## Site
oncoplot(maf = laml_tum, clinicalFeatures = c('Site2','Site'), sortByAnnotation = TRUE)

## Stratified by histology
## Serous
oncoplot(maf = laml_ser, clinicalFeatures = c('Site2','Site','celltype'), sortByAnnotation = TRUE)

## Non-Serous
oncoplot(maf = laml_other, clinicalFeatures = c('Site2','Site','celltype'), sortByAnnotation = TRUE)

Oncogenic signaling pathways

sigpw plots enrichment for known oncogenic signaling pathways as defined in TCGA, Sanchez/Vega et al.

# Plot genes in 2 top oncogenic signaling pathways
oncoplot(maf = laml_tum, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 2)

# Collapse to just plot the pathways, now top 5
oncoplot(maf = laml_tum, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE)

Pathways by Response

# Response
oncoplot(maf = laml_tum, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE,
         clinicalFeatures = 'responseCB', sortByAnnotation = TRUE)

# Response by treatment
oncoplot(maf = laml_tum, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE,
         clinicalFeatures = 'response_tx', sortByAnnotation = TRUE)

Pathways by Histology

Since these are the top pathways for the whole cohort, they may be masking histologically-specific pathways.

This is where it is useful to subset the data. Here I am plotting the top pathways for serous and all non-serous patients separately.

# Serous
oncoplot(maf = laml_ser, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype')

# Non-serous 
oncoplot(maf = laml_other, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype')

Biological processes of known tumor drivers

smgbp plots enrichment for pan-cancer significantly mutated genes, classified by biological processes, per Bailey et al.

Here are the top 2 biological processes of known drivers

# Note that I have highlighted the BRCA1/2 germline variants
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2, 
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

Pathways by Response & Histology

# Response
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'responseCB', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

# Detailed response (note that none of the germline variants had CR or PR)
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'response2', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

# Response by treatment 
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'response_tx', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

# Histology
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'celltype', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

Pathways by mutational signature*

oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'Signature', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

#KelseyNote

*Will need to return to this, but it seems like sig1 is more BRCA1 while sig2 is more BRCA2.

I need to re-call the mutational signatures (downstream of these analyses in my original script) and update this section.

Top pathways

Now collapsing the plots to show only the pathways

# Top 5 pathways 
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE)

# Top 10 pathways*
# *I've set `topPathways` = 10 but this is 24 pathways, not sure what's going on there...
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE)

# By response
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE,
         clinicalFeatures = 'responseCB', sortByAnnotation = TRUE)

# By response and treatment
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE,
         clinicalFeatures = 'response_tx', sortByAnnotation = TRUE)

# By histology
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype', sortByAnnotation = TRUE)

Top pathways by histology

As with the oncogenic signaling pathways, I am plotting these separately for serous and non-serous.

# Serous
oncoplot(maf = laml_ser, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype')

# Non-serous
oncoplot(maf = laml_other, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10,
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype')

Additional oncoprints with more complex annotations

Code
## Color coding
ghibli_colors <- ghibli_palette("PonyoMedium", type = "discrete")
# ghibli_colors
respcolors <- c("#278B9AFF","#E75B64FF")
names(respcolors) = c("CB","NCB")
fabcolors = RColorBrewer::brewer.pal(n = 5,name = 'Spectral')
names(fabcolors) = c("CR", "PR", "SD_CB", "SD_NCB", "PD")
txcolors = c("#BEAED4","#7FC97F")
names(txcolors) = c("Nivo","Combo")
cellcolors = RColorBrewer::brewer.pal(n = 6,name = 'PRGn')
names(cellcolors) = c("Serous_Adenocarcinoma","Endometrioid_Adenocarcinoma","Adenocarcinoma",
                      "Clear_Cell","Mixed_Epithelial","Undifferentiated_Carcinoma")


anno_cols = list(tx = txcolors, response2 = fabcolors, Survtime = "Blues", celltype = cellcolors)
anno_cols2 = list(tx = txcolors, responseCB = respcolors, Survtime = "Blues", celltype = cellcolors)
#print(anno_cols2)
oncoplot(
  maf = laml_tum,
  clinicalFeatures = c('response2','tx',  'Survtime','celltype'),
  sortByAnnotation = TRUE,
  annotationColor = anno_cols
)

oncoplot(
  maf = laml_tum,
  clinicalFeatures = c('responseCB','tx','Survtime','celltype'),
  sortByAnnotation = TRUE,
  draw_titv = TRUE,
  annotationColor = anno_cols2,
  pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 2,
  additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC")
)

Other Visualizations

Transitions and transversions

Transitions are changes between two purines (A <-> G) or two pyrimidines (C <-> T), involving bases of simliar shapes.

Transversions are changes from a purine to a pyrimidine or vice versa, resulting in the exchange of one-ring and two-ring structures.

Here’s a nice picture from this page:

Transitions tend to be more common than transversions, despite there being twice as many possible transversions.

The titiv function summarizes these for the dataset:

laml.titv = titv(maf = laml_tum, plot = FALSE, useSyn = TRUE)
#plot titv summary
plotTiTv(res = laml.titv, plotNotch = TRUE)

Here are the plots separated by histology:

# Serous
laml.titv.ser = titv(maf = laml_ser, plot = FALSE, useSyn = TRUE)
plotTiTv(res = laml.titv.ser, plotNotch = TRUE)

# Non-Serous
laml.titv.other = titv(maf = laml_other, plot = FALSE, useSyn = TRUE)
plotTiTv(res = laml.titv.other)

Lollipop plots

Draws a lollipop plot of amino acid changes on a protein structure (protein domains are derived from the PFAM database).

Let’s look at the plot for TP53 (the most mutated gene in overall ovarian cohort)

# Lollipop plot for TP53
lollipopPlot(
  maf = laml_tum,
  gene = 'TP53',
  showMutationRate = TRUE
)
     HGNC    refseq.ID   protein.ID aa.length
   <char>       <char>       <char>     <num>
1:   TP53    NM_000546    NP_000537       393
2:   TP53 NM_001126112 NP_001119584       393
3:   TP53 NM_001126113 NP_001119585       346
4:   TP53 NM_001126114 NP_001119586       341
5:   TP53 NM_001126115 NP_001119587       261
6:   TP53 NM_001126116 NP_001119588       209
7:   TP53 NM_001126117 NP_001119589       214
8:   TP53 NM_001126118 NP_001119590       354

However, we know it’s the most commonly mutated gene because the majority of the samples are Serous.

This is what it looks like stratified by histology:

# Serous
lollipopPlot(
  maf = laml_ser,
  gene = 'TP53',
  showMutationRate = TRUE
)
     HGNC    refseq.ID   protein.ID aa.length
   <char>       <char>       <char>     <num>
1:   TP53    NM_000546    NP_000537       393
2:   TP53 NM_001126112 NP_001119584       393
3:   TP53 NM_001126113 NP_001119585       346
4:   TP53 NM_001126114 NP_001119586       341
5:   TP53 NM_001126115 NP_001119587       261
6:   TP53 NM_001126116 NP_001119588       209
7:   TP53 NM_001126117 NP_001119589       214
8:   TP53 NM_001126118 NP_001119590       354

# Non-Serous
lollipopPlot(
  maf = laml_other,
  gene = 'TP53',
  showMutationRate = TRUE
)
     HGNC    refseq.ID   protein.ID aa.length
   <char>       <char>       <char>     <num>
1:   TP53    NM_000546    NP_000537       393
2:   TP53 NM_001126112 NP_001119584       393
3:   TP53 NM_001126113 NP_001119585       346
4:   TP53 NM_001126114 NP_001119586       341
5:   TP53 NM_001126115 NP_001119587       261
6:   TP53 NM_001126116 NP_001119588       209
7:   TP53 NM_001126117 NP_001119589       214
8:   TP53 NM_001126118 NP_001119590       354

What about ARID1A? Here is the full cohort:

# Lollipop plot for ARID1A
lollipopPlot(
  maf = laml_tum,
  gene = 'ARID1A',
  showMutationRate = TRUE
)
     HGNC refseq.ID protein.ID aa.length
   <char>    <char>     <char>     <num>
1: ARID1A NM_006015  NP_006006      2285
2: ARID1A NM_139135  NP_624361      2068

And stratified by histology:

# Serous
lollipopPlot(
  maf = laml_ser,
  gene = 'ARID1A',
  showMutationRate = TRUE
)
     HGNC refseq.ID protein.ID aa.length
   <char>    <char>     <char>     <num>
1: ARID1A NM_006015  NP_006006      2285
2: ARID1A NM_139135  NP_624361      2068

# Non-Serous
lollipopPlot(
  maf = laml_other,
  gene = 'ARID1A',
  showMutationRate = TRUE
)
     HGNC refseq.ID protein.ID aa.length
   <char>    <char>     <char>     <num>
1: ARID1A NM_006015  NP_006006      2285
2: ARID1A NM_139135  NP_624361      2068

#KelseyNote

I need to better understand what specifically is being plotted here; does this mean that all the mutations are found in introns and not actually protein-coding regions?

These regions don’t seem to map exactly to the exons when I look at ARID1A in UCSC Genome Browser.

Rainfall plots

Rainfall plots illustrate “hypermutated” genomic regions.

Setting detectChangePoints = TRUE detects genomic change points where potential kataegis are found.

Kataegis are genomic segments containing six or more consecutive mutations with an average inter-mutation distance of <=1,000 bp.

From the maftools help page for rainfallPlot():

Kategis detection algorithm by Moritz Goretzky at WWU Munster, which exploits the definition of Kategis (six consecutive mutations with an avg. distance of 1000 bp ) to identify hyper mutated genomic loci.

  • Algorithm starts with a double-ended queue to which six consecutive mutations are added and their average inter-mutation distance is calculated.
  • If the average inter-mutation distance is larger than 1000, one element is added at the back of the queue and one is removed from the front.
  • If the average inter-mutation distance is less or equal to 1000, further mutations are added until the average inter-mutation distance is larger than 1000.
  • After that, all mutations in the double-ended queue are written into output as one kataegis and the double-ended queue is reinitialized with six mutations.

By default, it plots the most mutated sample.

rainfallPlot(
  maf = laml_tum, 
  detectChangePoints = TRUE, 
  pointSize = 0.4,
  ref.build = "hg38"
)
Processing GADCHA..
No changepoints detected!

# Here's a sample with kataegis
rainfallPlot(
  maf = laml_tum, 
  detectChangePoints = TRUE, 
  pointSize = 0.4,
   tsb = "GADGAE",
  ref.build = "hg38"
 )
Processing GADGAE..
Kataegis detected at:
   Chromosome Start_Position End_Position nMuts Avg_intermutation_dist  Size
        <num>          <num>        <num> <int>                  <num> <num>
1:         11        1179999      1182583     6                  516.8  2584
   Tumor_Sample_Barcode   C>G   C>T   T>C
                 <fctr> <int> <int> <int>
1:               GADGAE     1     3     2

#KelseyNote

I need to think of a good way to evaluate these for all samples; maybe just a script for only this for each sample and then summarize?

Think about this…

Compare TMB to TCGA

## Capture size for Twist Bioscience Human Comprehensive Exome kit (used for WES) is 36.8Mb
laml.mutload = tcgaCompare(maf = laml_tum, cohortName = 'NRG-GY003', logscale = TRUE, capture_size = 36.8)

# Just in serous
laml.mutload = tcgaCompare(maf = laml_ser, cohortName = 'NRG-GY003', logscale = TRUE, capture_size = 36.8)

# Other
laml.mutload = tcgaCompare(maf = laml_other, cohortName = 'NRG-GY003', logscale = TRUE, capture_size = 36.8)

#KelseyNote

This is much better than when I used the Genewiz analysis (where they likely used a common reference genome and called everything outside that a somatic mutation), but this still seems a bit high compared to the ovarian cohort.

I know most of the TCGA samples are primaries, but I guess most of these are too, so it should be fairly comparable. Need to consider how much of a problem this is (or isn’t).

Analysis

Somatic interactions

Mutually exclusive or co-occurring set of genes can be detected using the somaticInteractions function, which performs pair-wise Fisher’s Exact test to detect such significant pair of genes.

# Exclusive/co-occurance event analysis on top 10 mutated genes
## Full cohort
somaticInteractions(maf = laml_tum, top = 25, pvalue = c(0.05, 0.1))

         gene1     gene2       pValue oddsRatio    00    01    11    10
        <char>    <char>        <num>     <num> <int> <int> <int> <int>
  1: GOLGA6L10     MUC3A 1.488870e-09       Inf    78     1     7     0
  2:  PRAMEF18    GTPBP6 5.241201e-08 197.81731    76     1     7     2
  3:      TP53  PRAMEF18 1.019193e-05   0.00000    18     9     0    59
  4:    GTPBP6 GOLGA6L10 3.170550e-05  54.07523    76     2     5     3
  5:    GTPBP6  GOLGA6L9 3.170550e-05  54.07523    76     2     5     3
 ---                                                                   
296:    MYO18B GOLGA6L10 1.000000e+00   0.00000    72     7     0     7
297:     TENM3 GOLGA6L10 1.000000e+00   0.00000    72     7     0     7
298:    MYO18B  GOLGA6L9 1.000000e+00   0.00000    72     7     0     7
299:     SVEP1  GOLGA6L9 1.000000e+00   0.00000    72     7     0     7
300:     TENM3  GOLGA6L9 1.000000e+00   0.00000    72     7     0     7
             pAdj              Event              pair event_ratio
            <num>             <char>            <char>      <char>
  1: 3.446459e-08       Co_Occurence  GOLGA6L10, MUC3A         7/1
  2: 1.129569e-06       Co_Occurence  GTPBP6, PRAMEF18         7/3
  3: 2.054825e-04 Mutually_Exclusive    PRAMEF18, TP53        0/68
  4: 5.661696e-04       Co_Occurence GOLGA6L10, GTPBP6         5/5
  5: 5.661696e-04       Co_Occurence  GOLGA6L9, GTPBP6         5/5
 ---                                                              
296: 1.000000e+00 Mutually_Exclusive GOLGA6L10, MYO18B        0/14
297: 1.000000e+00 Mutually_Exclusive  GOLGA6L10, TENM3        0/14
298: 1.000000e+00 Mutually_Exclusive  GOLGA6L9, MYO18B        0/14
299: 1.000000e+00 Mutually_Exclusive   GOLGA6L9, SVEP1        0/14
300: 1.000000e+00 Mutually_Exclusive   GOLGA6L9, TENM3        0/14
## Serous only
somaticInteractions(maf = laml_ser, top = 25, pvalue = c(0.05, 0.1))

      gene1  gene2       pValue oddsRatio    00    01    11    10        pAdj
     <char> <char>        <num>     <num> <int> <int> <int> <int>       <num>
  1:   TP53 LILRB3 4.838743e-06   0.00000     8     7     0    56 0.000112008
  2: MYO18B  DNAH1 2.205783e-04  51.73280    63     2     4     2 0.004753842
  3: PLXNB2    EYS 6.320328e-03  18.60432    62     3     3     3 0.119703175
  4:  RIMS1 PLXNB2 6.320328e-03  18.60432    62     3     3     3 0.119703175
  5:  HMCN2 LOXHD1 1.068399e-02  14.02065    61     4     3     3 0.180472880
 ---                                                                         
296:   PEG3  MYOM2 1.000000e+00   0.00000    59     6     0     6 1.000000000
297: PLXNB2  MYOM2 1.000000e+00   0.00000    59     6     0     6 1.000000000
298:  RIMS1  MYOM2 1.000000e+00   0.00000    59     6     0     6 1.000000000
299: PLXNB2   PEG3 1.000000e+00   0.00000    59     6     0     6 1.000000000
300:  RIMS1   PEG3 1.000000e+00   0.00000    59     6     0     6 1.000000000
                  Event          pair event_ratio
                 <char>        <char>      <char>
  1: Mutually_Exclusive  LILRB3, TP53        0/63
  2:       Co_Occurence DNAH1, MYO18B         4/4
  3:       Co_Occurence   EYS, PLXNB2         3/6
  4:       Co_Occurence PLXNB2, RIMS1         3/6
  5:       Co_Occurence HMCN2, LOXHD1         3/7
 ---                                             
296: Mutually_Exclusive   MYOM2, PEG3        0/12
297: Mutually_Exclusive MYOM2, PLXNB2        0/12
298: Mutually_Exclusive  MYOM2, RIMS1        0/12
299: Mutually_Exclusive  PEG3, PLXNB2        0/12
300: Mutually_Exclusive   PEG3, RIMS1        0/12
## Non-Serous
somaticInteractions(maf = laml_other, top = 25, pvalue = c(0.05, 0.1))

        gene1     gene2       pValue oddsRatio    00    01    11    10
       <char>    <char>        <num>     <num> <int> <int> <int> <int>
  1:   MUC5AC    GTPBP6 0.0003330003       Inf    10     0     5     0
  2: PRAMEF18  GOLGA6L9 0.0007326007       Inf    11     0     4     0
  3:     TP53     BIRC6 0.0021978022       Inf    12     0     3     0
  4:    ZBED3 GOLGA6L10 0.0021978022       Inf    12     0     3     0
  5: GOLGA6L9    GTPBP6 0.0036630037       Inf    10     1     4     0
 ---                                                                  
296:  CCDC187     ABCA4 1.0000000000         0    11     2     0     2
297:  CCDC187      ABL2 1.0000000000         0    11     2     0     2
298:   HECTD4   CCDC187 1.0000000000         0    11     2     0     2
299:   IFT140   CCDC187 1.0000000000         0    11     2     0     2
300:     SLX4   CCDC187 1.0000000000         0    11     2     0     2
           pAdj              Event               pair event_ratio
          <num>             <char>             <char>      <char>
  1: 0.03468753       Co_Occurence     GTPBP6, MUC5AC         5/0
  2: 0.03815629       Co_Occurence GOLGA6L9, PRAMEF18         4/0
  3: 0.05283178       Co_Occurence        BIRC6, TP53         3/0
  4: 0.05283178       Co_Occurence   GOLGA6L10, ZBED3         3/0
  5: 0.06733463       Co_Occurence   GOLGA6L9, GTPBP6         4/1
 ---                                                             
296: 1.00000000 Mutually_Exclusive     ABCA4, CCDC187         0/4
297: 1.00000000 Mutually_Exclusive      ABL2, CCDC187         0/4
298: 1.00000000 Mutually_Exclusive    CCDC187, HECTD4         0/4
299: 1.00000000 Mutually_Exclusive    CCDC187, IFT140         0/4
300: 1.00000000 Mutually_Exclusive      CCDC187, SLX4         0/4
## CB
somaticInteractions(maf = laml_CB, top = 25, pvalue = c(0.05, 0.1))

       gene1     gene2      pValue oddsRatio    00    01    11    10       pAdj
      <char>    <char>       <num>     <num> <int> <int> <int> <int>      <num>
  1: ADPRHL1    ADGRG4 0.002736727       Inf    24     2     3     0 0.06335016
  2:   OPLAH   DYNC2H1 0.004252453   46.9969    24     1     3     1 0.08053887
  3:  MUC5AC     HMCN2 0.004252453   46.9969    24     1     3     1 0.08053887
  4:    NRDC     HMCN2 0.004252453   46.9969    24     1     3     1 0.08053887
  5:  ARID1A    IFT140 0.005473454       Inf    23     0     3     3 0.09774025
 ---                                                                           
296:   LRP1B    MUC5AC 1.000000000    0.0000    22     4     0     3 1.00000000
297:   ZFHX4      NRDC 1.000000000    0.0000    21     4     0     4 1.00000000
298:   ABCA4      NRDC 1.000000000    0.0000    22     4     0     3 1.00000000
299:   ZFHX4 SPATA31H1 1.000000000    0.0000    21     4     0     4 1.00000000
300:   ITPR3 SPATA31H1 1.000000000    0.0000    22     4     0     3 1.00000000
                  Event             pair event_ratio
                 <char>           <char>      <char>
  1:       Co_Occurence  ADGRG4, ADPRHL1         3/2
  2:       Co_Occurence   DYNC2H1, OPLAH         3/2
  3:       Co_Occurence    HMCN2, MUC5AC         3/2
  4:       Co_Occurence      HMCN2, NRDC         3/2
  5:       Co_Occurence   ARID1A, IFT140         3/3
 ---                                                
296: Mutually_Exclusive    LRP1B, MUC5AC         0/7
297: Mutually_Exclusive      NRDC, ZFHX4         0/8
298: Mutually_Exclusive      ABCA4, NRDC         0/7
299: Mutually_Exclusive SPATA31H1, ZFHX4         0/8
300: Mutually_Exclusive ITPR3, SPATA31H1         0/7
## NCB
somaticInteractions(maf = laml_NCB, top = 25, pvalue = c(0.05, 0.1))

        gene1    gene2       pValue oddsRatio    00    01    11    10
       <char>   <char>        <num>     <num> <int> <int> <int> <int>
  1:   LILRA6   GTPBP6 2.388284e-07       Inf    52     0     5     0
  2:   GTPBP6 PRAMEF18 5.015397e-06       Inf    50     2     5     0
  3:   LILRA6 PRAMEF18 5.015397e-06       Inf    50     2     5     0
  4:   LILRB3     TP53 7.623404e-05   0.00000    12    37     0     8
  5: PRAMEF18   LILRB3 2.543101e-04  33.58248    47     3     5     2
 ---                                                                 
296:     MTOR   LILRA6 1.000000e+00   0.00000    47     5     0     5
297:   PLXNB2   LILRA6 1.000000e+00   0.00000    47     5     0     5
298:   SPTBN4   LILRA6 1.000000e+00   0.00000    47     5     0     5
299:   PLXNB2     MTOR 1.000000e+00   0.00000    47     5     0     5
300:   SPTBN4     MTOR 1.000000e+00   0.00000    47     5     0     5
             pAdj              Event             pair event_ratio
            <num>             <char>           <char>      <char>
  1: 5.528436e-06       Co_Occurence   GTPBP6, LILRA6         5/0
  2: 1.011169e-04       Co_Occurence GTPBP6, PRAMEF18         5/2
  3: 1.011169e-04       Co_Occurence LILRA6, PRAMEF18         5/2
  4: 1.443826e-03 Mutually_Exclusive     LILRB3, TP53        0/45
  5: 4.541253e-03       Co_Occurence LILRB3, PRAMEF18         5/5
 ---                                                             
296: 1.000000e+00 Mutually_Exclusive     LILRA6, MTOR        0/10
297: 1.000000e+00 Mutually_Exclusive   LILRA6, PLXNB2        0/10
298: 1.000000e+00 Mutually_Exclusive   LILRA6, SPTBN4        0/10
299: 1.000000e+00 Mutually_Exclusive     MTOR, PLXNB2        0/10
300: 1.000000e+00 Mutually_Exclusive     MTOR, SPTBN4        0/10
#KelseyNote

I’m not sure how meaningful this analysis is.

Something to be cautious of, especially in the CB/NCB analysis, is that the cohort is enriched for Serous histology but that may not be evenly distributed between CB/NCB groups (should test this), so we could be picking up on histological differences that we mistake for real response signals.